Auto Repair Shop Demand Planning

MGT 6203: Data Analytics in Business
Team 45
Jeff Hedberg, Lisa Chille, Nick Cunningham, Brittany Lange, Delband Taha
July 06, 2022

Introduction

Background information

A hypothetical auto repair shop may struggle with demand forecasting since no one plans on getting into an accident. Therefore, none of their customers have any prior intent of availing their services before finding themselves with the challenge of getting their car repaired1. By statistically analyzing and making inferences from collision data, the shop stands a chance at getting an understanding of their customers’ patterns and can therefore ready themselves by making the appropriate hiring/firing and physical expansion/subletting decisions as necessary. With a little more effort, they can even learn which demographic(s) to raise awareness with such that in the event of a crisis, this repair shop is top of mind for the affected parties.

This project aims at supporting the annual planning process of a hypothetical repair shop in New York state through careful data analysis and inference that leads to recommendations for the auto repair shop owner(s).

Project motivation

Our objective with this project is to use data compiled by New York City police officers (NYPD) detailing New York motor vehicle collision reports to help a fictitious auto repair center in New York state estimate the volume of incoming vehicles they can expect to repair in the coming year, assuming their market share is known. Our analysis can help this repair center predict staffing levels that they will need to maintain and identify potential opportunities for expansion. We also aim to analyze the demographics of those involved in collisions and identify which groups make a significant contribution to car collisions. We can then propose and measure the impact of a marketing campaign for this repair center.

Assumptions

We assume equal market share for the 3164 Motor Vehicles motor vehicle repair shops found in New York City. This gives us an average market share of 0.0316% per shop, which we will use to estimate the volume of incoming vehicles they can expect.

Literature review

Research conducted by Songtao He, et al. He et al. at MIT’s Computer Science and Artificial Intelligence Laboratory and the Qatar Center for Artificial Intelligence in which they used crash data from four US cities, Los Angeles, New York City, Chicago, and Boston, to develop a model predicting the number of future crashes in high risk areas, has shown on a high resolution map helped us set direction for this project. They used historical crash data, combined with GPS and satellite images and road maps. These scientists evaluated their model on crash data from 2017 and 2018, and used the next two years to test its performance at predicting crashes. Ultimately, for this project we wanted to build models for a potential real world business application so we chose a project with a fictitious business. We chose to follow a similar methodology for splitting our data and evaluating and testing our model.

Planned approach

Our approach is detailed in this document. We intend to

  • Clean the data
  • Use linear regression as well as some tree-based regression approaches to describe and model the trends in the collision data.
  • Use the highest performing model to predict the volume of collisions in the city.
  • Use this prediction to extrapolate the demand that the repair shop will have.

The models we intend to use are also detailed in this report.

Preparation

Useful libraries

First, we import necessary libraries. These libraries will be used as follows:

Library Use
dplyr Data manipulation
knitr RMarkdown knitting
kableExtra Kable tables
plotly Plotting
lubridate Date manipulation
randomForest Building random forest models
rcart Building CART models
rpart.plot Plotting regression trees built
library(dplyr)
library(knitr)
library(kableExtra)
library(plotly)
library(lubridate)
library(randomForest)
library(rpart)
library(rpart.plot)

Data Wrangling

Summaries

Below, we import data from 3 sources and view the raw summaries. The data are related to each other as follows: !Data relationships

Helpers

These functions will be reused in our endeavour to read and summarise the data at hand.

read_data <- function(path) {
  data <- read.csv(path, stringsAsFactors = FALSE)
  colnames(data)[colnames(data) == 'CRASH.DATE'] <- 'CRASH_DATE'
  data$CRASH_DATE <- as.Date(data$CRASH_DATE, "%m/%d/%Y")
  data
}
tabulate_collisions <- function(data) {
  kable(t(summary(data))) %>% kable_classic(full_width = TRUE, html_font = "Cambria",
                                               font_size = 14)
}


Crashes

crashes_df <- read_data('./Crashes.csv') #1,896,229 x 29
tabulate_collisions(crashes_df)
CRASH_DATE Min. :2012-07-01 1st Qu.:2014-10-28 Median :2016-12-15 Mean :2017-01-01 3rd Qu.:2019-01-04 Max. :2022-05-29 NA
CRASH.TIME Length:1896229 Class :character Mode :character NA NA NA NA
BOROUGH Length:1896229 Class :character Mode :character NA NA NA NA
ZIP.CODE Min. :10000 1st Qu.:10306 Median :11207 Mean :10837 3rd Qu.:11237 Max. :11697 NA’s :587695
LATITUDE Min. : 0.00 1st Qu.:40.67 Median :40.72 Mean :40.64 3rd Qu.:40.77 Max. :43.34 NA’s :220042
LONGITUDE Min. :-201.36 1st Qu.: -73.98 Median : -73.93 Mean : -73.77 3rd Qu.: -73.87 Max. : 0.00 NA’s :220042
LOCATION Length:1896229 Class :character Mode :character NA NA NA NA
ON.STREET.NAME Length:1896229 Class :character Mode :character NA NA NA NA
CROSS.STREET.NAME Length:1896229 Class :character Mode :character NA NA NA NA
OFF.STREET.NAME Length:1896229 Class :character Mode :character NA NA NA NA
NUMBER.OF.PERSONS.INJURED Min. : 0.0000 1st Qu.: 0.0000 Median : 0.0000 Mean : 0.2873 3rd Qu.: 0.0000 Max. :43.0000 NA’s :18
NUMBER.OF.PERSONS.KILLED Min. :0.000000 1st Qu.:0.000000 Median :0.000000 Mean :0.001358 3rd Qu.:0.000000 Max. :8.000000 NA’s :31
NUMBER.OF.PEDESTRIANS.INJURED Min. : 0.00000 1st Qu.: 0.00000 Median : 0.00000 Mean : 0.05304 3rd Qu.: 0.00000 Max. :27.00000 NA
NUMBER.OF.PEDESTRIANS.KILLED Min. :0.000000 1st Qu.:0.000000 Median :0.000000 Mean :0.000697 3rd Qu.:0.000000 Max. :6.000000 NA
NUMBER.OF.CYCLIST.INJURED Min. :0.00000 1st Qu.:0.00000 Median :0.00000 Mean :0.02435 3rd Qu.:0.00000 Max. :4.00000 NA
NUMBER.OF.CYCLIST.KILLED Min. :0.0000000 1st Qu.:0.0000000 Median :0.0000000 Mean :0.0001007 3rd Qu.:0.0000000 Max. :2.0000000 NA
NUMBER.OF.MOTORIST.INJURED Min. : 0.0000 1st Qu.: 0.0000 Median : 0.0000 Mean : 0.2083 3rd Qu.: 0.0000 Max. :43.0000 NA
NUMBER.OF.MOTORIST.KILLED Min. :0.00000 1st Qu.:0.00000 Median :0.00000 Mean :0.00055 3rd Qu.:0.00000 Max. :5.00000 NA
CONTRIBUTING.FACTOR.VEHICLE.1 Length:1896229 Class :character Mode :character NA NA NA NA
CONTRIBUTING.FACTOR.VEHICLE.2 Length:1896229 Class :character Mode :character NA NA NA NA
CONTRIBUTING.FACTOR.VEHICLE.3 Length:1896229 Class :character Mode :character NA NA NA NA
CONTRIBUTING.FACTOR.VEHICLE.4 Length:1896229 Class :character Mode :character NA NA NA NA
CONTRIBUTING.FACTOR.VEHICLE.5 Length:1896229 Class :character Mode :character NA NA NA NA
COLLISION_ID Min. : 22 1st Qu.:3046695 Median :3584305 Mean :3021392 3rd Qu.:4058626 Max. :4533068 NA
VEHICLE.TYPE.CODE.1 Length:1896229 Class :character Mode :character NA NA NA NA
VEHICLE.TYPE.CODE.2 Length:1896229 Class :character Mode :character NA NA NA NA
VEHICLE.TYPE.CODE.3 Length:1896229 Class :character Mode :character NA NA NA NA
VEHICLE.TYPE.CODE.4 Length:1896229 Class :character Mode :character NA NA NA NA
VEHICLE.TYPE.CODE.5 Length:1896229 Class :character Mode :character NA NA NA NA


Person

person_df <- read_data('./Person.csv') #4,692,054 x 21
tabulate_collisions(person_df)
UNIQUE_ID Min. : 10922 1st Qu.: 6812186 Median : 8996148 Mean : 8531863 3rd Qu.:10216281 Max. :12239058 NA
COLLISION_ID Min. : 37 1st Qu.:3638855 Median :3921823 Mean :3853306 3rd Qu.:4210666 Max. :4533068 NA
CRASH_DATE Min. :2012-07-01 1st Qu.:2017-03-19 Median :2018-06-08 Mean :2018-07-08 3rd Qu.:2019-09-20 Max. :2022-05-29 NA
CRASH_TIME Length:4692054 Class :character Mode :character NA NA NA NA
PERSON_ID Length:4692054 Class :character Mode :character NA NA NA NA
PERSON_TYPE Length:4692054 Class :character Mode :character NA NA NA NA
PERSON_INJURY Length:4692054 Class :character Mode :character NA NA NA NA
VEHICLE_ID Min. : 123423 1st Qu.:17466247 Median :18528882 Mean :18253620 3rd Qu.:19125401 Max. :20229580 NA’s :185684
PERSON_AGE Min. :-999.0 1st Qu.: 23.0 Median : 35.0 Mean : 36.8 3rd Qu.: 50.0 Max. :9999.0 NA’s :453265
EJECTION Length:4692054 Class :character Mode :character NA NA NA NA
EMOTIONAL_STATUS Length:4692054 Class :character Mode :character NA NA NA NA
BODILY_INJURY Length:4692054 Class :character Mode :character NA NA NA NA
POSITION_IN_VEHICLE Length:4692054 Class :character Mode :character NA NA NA NA
SAFETY_EQUIPMENT Length:4692054 Class :character Mode :character NA NA NA NA
PED_LOCATION Length:4692054 Class :character Mode :character NA NA NA NA
PED_ACTION Length:4692054 Class :character Mode :character NA NA NA NA
COMPLAINT Length:4692054 Class :character Mode :character NA NA NA NA
PED_ROLE Length:4692054 Class :character Mode :character NA NA NA NA
CONTRIBUTING_FACTOR_1 Length:4692054 Class :character Mode :character NA NA NA NA
CONTRIBUTING_FACTOR_2 Length:4692054 Class :character Mode :character NA NA NA NA
PERSON_SEX Length:4692054 Class :character Mode :character NA NA NA NA


Vehicles

vehicles_df <- read_data('./Vehicles.csv') #3,704,406 x 25
tabulate_collisions(vehicles_df)
UNIQUE_ID Min. : 111711 1st Qu.:14215160 Median :17306058 Mean :16060871 3rd Qu.:18739205 Max. :20121717 NA
COLLISION_ID Min. : 22 1st Qu.:3017853 Median :3567068 Mean :2996659 3rd Qu.:4028145 Max. :4484197 NA
CRASH_DATE Min. :2012-07-01 1st Qu.:2014-10-15 Median :2016-11-18 Mean :2016-11-21 3rd Qu.:2018-11-15 Max. :2021-12-04 NA
CRASH_TIME Length:3704406 Class :character Mode :character NA NA NA NA
VEHICLE_ID Length:3704406 Class :character Mode :character NA NA NA NA
STATE_REGISTRATION Length:3704406 Class :character Mode :character NA NA NA NA
VEHICLE_TYPE Length:3704406 Class :character Mode :character NA NA NA NA
VEHICLE_MAKE Length:3704406 Class :character Mode :character NA NA NA NA
VEHICLE_MODEL Length:3704406 Class :character Mode :character NA NA NA NA
VEHICLE_YEAR Min. : 1000 1st Qu.: 2008 Median : 2013 Mean : 2015 3rd Qu.: 2016 Max. :20063 NA’s :1796971
TRAVEL_DIRECTION Length:3704406 Class :character Mode :character NA NA NA NA
VEHICLE_OCCUPANTS Min. :0.00e+00 1st Qu.:1.00e+00 Median :1.00e+00 Mean :1.01e+03 3rd Qu.:1.00e+00 Max. :1.00e+09 NA’s :1718406
DRIVER_SEX Length:3704406 Class :character Mode :character NA NA NA NA
DRIVER_LICENSE_STATUS Length:3704406 Class :character Mode :character NA NA NA NA
DRIVER_LICENSE_JURISDICTION Length:3704406 Class :character Mode :character NA NA NA NA
PRE_CRASH Length:3704406 Class :character Mode :character NA NA NA NA
POINT_OF_IMPACT Length:3704406 Class :character Mode :character NA NA NA NA
VEHICLE_DAMAGE Length:3704406 Class :character Mode :character NA NA NA NA
VEHICLE_DAMAGE_1 Length:3704406 Class :character Mode :character NA NA NA NA
VEHICLE_DAMAGE_2 Length:3704406 Class :character Mode :character NA NA NA NA
VEHICLE_DAMAGE_3 Length:3704406 Class :character Mode :character NA NA NA NA
PUBLIC_PROPERTY_DAMAGE Length:3704406 Class :character Mode :character NA NA NA NA
PUBLIC_PROPERTY_DAMAGE_TYPE Length:3704406 Class :character Mode :character NA NA NA NA
CONTRIBUTING_FACTOR_1 Length:3704406 Class :character Mode :character NA NA NA NA
CONTRIBUTING_FACTOR_2 Length:3704406 Class :character Mode :character NA NA NA NA


Cleaning and combining data sets

As a result of a traffic safety initiative to eliminate traffic fatalities, the NYPD replaced its record management system with an electronic one, (FORMS), in 2016 (“Motor Vehicle Collisions - Crashes”). We see this change reflected in the chart below.

monthly_agg1 <- (person_df %>% 
                   mutate(yr_mo = paste0(substr(CRASH_DATE,1,4),'-',substr(CRASH_DATE,6,7))) %>% 
                   group_by(yr_mo) %>% 
                   summarise(n = n()) %>% 
                   arrange(yr_mo)
                 )
plot_ly(x=monthly_agg1$yr_mo, y=monthly_agg1$n, type='bar') %>%
  layout(title = "Crash Data by Month from 2012-2022", xaxis = list(title = 'Month'), yaxis = list(title = 'Number of Crashes'))

Note that the amount of data collected from March 2016 on greatly surpasses the amount previously collected. To control for the change in data recording systems, we will use data collected after January 1st, 2017 for full year modeling.

We are now ready to filter the data to select columns of interest. We will also combine data from our three sources into one place, using their common factors. A summary can be found below.

combined_df <- (crashes_df %>% 
                  select(-c(CRASH_DATE, CRASH.TIME, CONTRIBUTING.FACTOR.VEHICLE.1, CONTRIBUTING.FACTOR.VEHICLE.2,
                            CONTRIBUTING.FACTOR.VEHICLE.3, CONTRIBUTING.FACTOR.VEHICLE.4, CONTRIBUTING.FACTOR.VEHICLE.5,
                            VEHICLE.TYPE.CODE.4, VEHICLE.TYPE.CODE.5,
                            NUMBER.OF.PEDESTRIANS.INJURED, NUMBER.OF.PEDESTRIANS.KILLED,
                            NUMBER.OF.CYCLIST.INJURED, NUMBER.OF.CYCLIST.KILLED,
                            ON.STREET.NAME, CROSS.STREET.NAME, OFF.STREET.NAME)) %>% 
                  inner_join(vehicles_df %>% 
                               filter(!is.na(VEHICLE_ID)) %>% 
                               select(-c(CRASH_DATE, CRASH_TIME, VEHICLE_ID))
                             , by = 'COLLISION_ID') %>% 
                  inner_join(person_df %>% 
                               select(-c(UNIQUE_ID, CONTRIBUTING_FACTOR_1, CONTRIBUTING_FACTOR_2))
                             , by= c('COLLISION_ID'='COLLISION_ID', 'UNIQUE_ID'='VEHICLE_ID')) %>% 
                  filter((PED_ROLE %in% c('Driver'))) %>%  #drivers only
                  filter((CRASH_DATE >= '2017-01-01') & (CRASH_DATE < '2021-12-01')) %>% #only from 2017-01-01 to 2021-11-30
                  filter(PERSON_AGE > 14 & PERSON_AGE< 101) %>% 
                  filter(PERSON_SEX == 'F' | PERSON_SEX=='M') %>% 
                  mutate(MONTH = substr(CRASH_DATE,6,7)) %>% 
                  mutate(TIMESTEP = as.period(interval(as.Date('2017-01-01'), CRASH_DATE)) %/% months(1)) %>% 
                  mutate(yr_mo = paste0(substr(CRASH_DATE,1,4),'-',substr(CRASH_DATE,6,7))) %>%
                  select(COLLISION_ID, CRASH_DATE, PERSON_AGE, PERSON_SEX, MONTH, TIMESTEP, yr_mo) %>% 
                  arrange(CRASH_DATE)
                )
kable(t(summary(combined_df))) %>% kable_classic(full_width = TRUE, html_font = "Cambria", font_size = 14)
COLLISION_ID Min. :3589945 1st Qu.:3805149 Median :4017960 Mean :4023043 3rd Qu.:4234930 Max. :4484186
CRASH_DATE Min. :2017-01-01 1st Qu.:2017-12-04 Median :2018-11-05 Mean :2018-12-28 3rd Qu.:2019-11-01 Max. :2021-11-30
PERSON_AGE Min. : 15.00 1st Qu.: 29.00 Median : 40.00 Mean : 41.69 3rd Qu.: 53.00 Max. :100.00
PERSON_SEX Length:1339106 Class :character Mode :character NA NA NA
MONTH Length:1339106 Class :character Mode :character NA NA NA
TIMESTEP Min. : 0.00 1st Qu.:11.00 Median :22.00 Mean :23.44 3rd Qu.:34.00 Max. :58.00
yr_mo Length:1339106 Class :character Mode :character NA NA NA

Data Exploration

Effect of COVID-19

We create our aggregate of volume data available for modeling and visualize it with a plot. It is interesting to note the impact of COVID-19 on crash volume beginning in March 2020.

monthly_agg2 <- (combined_df %>% 
                   group_by(TIMESTEP, yr_mo, MONTH) %>% 
                   summarise(n = n(), .groups = 'drop') %>% 
                   arrange(TIMESTEP, yr_mo, MONTH)
                 )
plot_ly(x=monthly_agg2$yr_mo, y=monthly_agg2$n, type='bar') %>%
  layout(title = "Crash Data by Month from 2017-2021", xaxis = list(title = 'Month'), yaxis = list(title = 'Number of Crashes'))

Modeling

Splitting into training and testing data sets

We also created a coarse aggregate for modeling with feature groups and separate our data into training and test data sets. Since we are dealing with a temporal model, we will select all except the final year for training, and then use the final year for testing. A random split would be nonsensical for our purposes.

monthly_agg3 <- (combined_df %>% 
                   group_by(TIMESTEP, yr_mo, MONTH, PERSON_SEX, PERSON_AGE) %>% 
                   summarise(n = n(), .groups = 'drop') %>% 
                   arrange(TIMESTEP, yr_mo, MONTH, PERSON_SEX, PERSON_AGE)
)

train_df <- monthly_agg3 %>% filter(TIMESTEP <= 47) # <= 2020-12
test_df <- monthly_agg3 %>% filter(TIMESTEP > 47) # > 2020-12

Linear Regression

Training

lm_model <- lm(n~TIMESTEP+MONTH+PERSON_AGE+PERSON_SEX, data = train_df)
lm_summary <- summary(lm_model)
lm_summary
## 
## Call:
## lm(formula = n ~ TIMESTEP + MONTH + PERSON_AGE + PERSON_SEX, 
##     data = train_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -441.83  -64.83    4.23   58.78  299.83 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 352.82598    5.76802  61.169  < 2e-16 ***
## TIMESTEP     -2.92530    0.09545 -30.646  < 2e-16 ***
## MONTH02     -10.16351    6.23069  -1.631 0.102889    
## MONTH03       5.85024    6.23280   0.939 0.347955    
## MONTH04     -17.30059    6.27380  -2.758 0.005837 ** 
## MONTH05      13.68088    6.25906   2.186 0.028863 *  
## MONTH06      19.15099    6.25343   3.062 0.002203 ** 
## MONTH07      13.20462    6.25123   2.112 0.034691 *  
## MONTH08      12.80755    6.26343   2.045 0.040908 *  
## MONTH09      18.12040    6.27354   2.888 0.003883 ** 
## MONTH10      27.72229    6.28431   4.411 1.04e-05 ***
## MONTH11      22.61062    6.30315   3.587 0.000336 ***
## MONTH12      23.03260    6.32483   3.642 0.000273 ***
## PERSON_AGE   -3.92660    0.05504 -71.344  < 2e-16 ***
## PERSON_SEXM 150.90222    2.55012  59.175  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 111.3 on 7619 degrees of freedom
## Multiple R-squared:  0.5457, Adjusted R-squared:  0.5449 
## F-statistic: 653.8 on 14 and 7619 DF,  p-value: < 2.2e-16
R_sq_lm_train <- lm_summary$r.squared
R_sq_lm_train
## [1] 0.5457191
SST_lm_train <- sum((train_df$n - mean(train_df$n))^2)
SSE_lm_train <- sum((train_df$n - predict(lm_model, newdata = train_df))^2)
R_sq_lm_train <- 1-(SSE_lm_train/SST_lm_train)
R_sq_lm_train
## [1] 0.5457191

Visualisation

predictions <- predict(lm_model, newdata = train_df)
plot_ly(x=train_df$yr_mo, y=(predictions-train_df$n), type='scatter', mode='markers') %>% 
  layout(title = 'Plot of Linear Model Residuals vs yr_mo for Training Data',
         xaxis = list(title = 'yr_mo'),
         yaxis = list(title = 'LM Residuals', rangemode = "tozero"))

Test Metrics

SST_lm_test <- sum((test_df$n - mean(test_df$n))^2)
SSE_lm_test <- sum((test_df$n - predict(lm_model, newdata = test_df))^2)
R_sq_lm_test <- 1-(SSE_lm_test/SST_lm_test)
R_sq_lm_test
## [1] 0.08364779
plot_ly(x=test_df$yr_mo, y=(predict(lm_model, newdata = test_df)-test_df$n), type='scatter', mode='markers') %>% 
  layout(title = 'Plot of Linear Model Residuals vs yr_mo for Test Data',
         xaxis = list(title = 'yr_mo'),
         yaxis = list(title = 'LM Residuals', rangemode = "tozero"))

Our linear model with training data had an R-squared value of 0.55, while it’s performance on test data dropped the R-squared value to 0.08.

Random forest

Training

set.seed(12345)
rf_model <- randomForest(n ~ TIMESTEP + MONTH + PERSON_AGE + PERSON_SEX, data=train_df, importance=TRUE, ntree=110)  #
summary(rf_model)
##                 Length Class  Mode     
## call               5   -none- call     
## type               1   -none- character
## predicted       7634   -none- numeric  
## mse              110   -none- numeric  
## rsq              110   -none- numeric  
## oob.times       7634   -none- numeric  
## importance         8   -none- numeric  
## importanceSD       4   -none- numeric  
## localImportance    0   -none- NULL     
## proximity          0   -none- NULL     
## ntree              1   -none- numeric  
## mtry               1   -none- numeric  
## forest            11   -none- list     
## coefs              0   -none- NULL     
## y               7634   -none- numeric  
## test               0   -none- NULL     
## inbag              0   -none- NULL     
## terms              3   terms  call
var_importance_df <- data.frame(importance(rf_model)) %>% rename('PCT_IncMSE'='X.IncMSE') %>% arrange(desc(PCT_IncMSE))
var_importance_df
SST_rf_train <- sum((train_df$n - mean(train_df$n))^2)
SSE_rf_train <- sum((train_df$n - predict(rf_model, newdata = train_df))^2)
R_sq_rf_train <- 1-(SSE_rf_train/SST_rf_train)
R_sq_rf_train
## [1] 0.8639321

Visualisation

plot_ly(x=seq(1,length(rf_model$mse),1), y=rf_model$mse, mode='lines+markers', type='scatter') %>% 
  layout(title = 'Plot of Random Forest Training Error vs Number of Trees',
         xaxis = list(title = 'Number of Trees'),
         yaxis = list(title = 'Error', rangemode = "tozero"))
plot_ly(x=train_df$yr_mo, y=(predict(rf_model, newdata = train_df)-train_df$n), type='scatter', mode='markers') %>% 
  layout(title = 'Plot of Random Forest Model Residuals vs yr_mo',
         xaxis = list(title = 'yr_mo'),
         yaxis = list(title = 'RF Residuals', rangemode = "tozero"))

Test Metrics

SST_rf_test <- sum((test_df$n - mean(test_df$n))^2)
SSE_rf_test <- sum((test_df$n - predict(rf_model, newdata = test_df))^2)
R_sq_rf_test <- 1-(SSE_rf_test/SST_rf_test)
R_sq_rf_test
## [1] 0.7582554
plot_ly(x=test_df$yr_mo, y=(predict(rf_model, newdata = test_df)-test_df$n), type='scatter', mode='markers') %>% 
  layout(title = 'Plot of Random Forest Model Residuals vs yr_mo for Test Data',
         xaxis = list(title = 'yr_mo'),
         yaxis = list(title = 'RF Residuals', rangemode = "tozero"))

Our random forest model performed much better, giving us an R-squared value of 0.80 on training data and 0.68 on test data.

Classification and Regression Tree (CART)

Training

set.seed(12345)
min_leaf <- 3
min_split <- 3*min_leaf
cart_model <- rpart(n ~ TIMESTEP + MONTH + PERSON_AGE + PERSON_SEX, data=train_df, 
                    control = c(minsplit = min_split, minbucket = min_leaf, cp=0.01)) #default = 0.01
SST_CART_train <- sum((train_df$n - mean(train_df$n))^2)
SSE_CART_train <- sum((train_df$n - predict(cart_model, newdata = train_df))^2)
R_sq_cart_train <- 1-(SSE_CART_train/SST_CART_train)
R_sq_cart_train
## [1] 0.9023338

Visualisation

data.frame(Variable_Importance = cart_model$variable.importance, 
           Variable_Importance_Pct_Tot = round(100*cart_model$variable.importance/sum(cart_model$variable.importance),0))
rpart.plot(cart_model)

plot_ly(x=train_df$yr_mo, y=(predict(cart_model, newdata = train_df)-train_df$n), type='scatter', mode='markers') %>% 
  layout(title = 'Plot of CART Model Residuals vs yr_mo',
         xaxis = list(title = 'yr_mo'),
         yaxis = list(title = 'CART Residuals', rangemode = "tozero"))

Test Metrics

SST_cart_test <- sum((test_df$n - mean(test_df$n))^2)
SSE_cart_test <- sum((test_df$n - predict(cart_model, newdata = test_df))^2)
R_sq_cart_test <- 1-(SSE_cart_test/SST_cart_test)
R_sq_cart_test
## [1] 0.65263
plot_ly(x=test_df$yr_mo, y=(predict(cart_model, newdata = test_df)-test_df$n), type='scatter', mode='markers') %>% 
  layout(title = 'Plot of CART Model Residuals vs yr_mo for Test Data',
         xaxis = list(title = 'yr_mo'),
         yaxis = list(title = 'CART Residuals', rangemode = "tozero"))

We see with the CART model that the R-squared value for training data is 0.90, but we see a drop on test data to 0.65.

Each model’s performance can be compared in the following plots.

Model comparison

Training Data

plot_ly(x=train_df$n, y=(predict(cart_model, newdata = train_df)), type='scatter', mode='markers', name='CART', alpha=0.3) %>%
  add_trace(x=train_df$n, y=(predict(rf_model, newdata = train_df)), type='scatter', mode='markers', name='RF', alpha=0.3) %>%
  add_trace(x=train_df$n, y=(predict(lm_model, newdata = train_df)), type='scatter', mode='markers', name='LM', alpha=0.3) %>%
  add_trace(x=c(0,500), y=c(0,500), type='scatter', mode='line', name='Perfect', alpha=1) %>%
  layout(title = 'Plot of All Model Predictions vs Actual Values - Training Data',
         xaxis = list(title = 'Actual Value', range=c(0,700)),
         yaxis = list(title = 'Model Prediction', rangemode = "tozero"))

Test Data

plot_ly(x=test_df$n, y=(predict(cart_model, newdata = test_df)), type='scatter', mode='markers', name='CART', alpha=0.3) %>%
  add_trace(x=test_df$n, y=(predict(rf_model, newdata = test_df)), type='scatter', mode='markers', name='RF', alpha=0.3) %>%
  add_trace(x=test_df$n, y=(predict(lm_model, newdata = test_df)), type='scatter', mode='markers', name='LM', alpha=0.3) %>%
  add_trace(x=c(0,300), y=c(0,300), type='scatter', mode='line', name='Perfect', alpha=1) %>%
  layout(title = 'Plot of All Model Predictions vs Actual Values - Test Data',
         xaxis = list(title = 'Actual Value', range=c(0,325)),
         yaxis = list(title = 'Model Prediction', rangemode = "tozero"))

Overall Model Performance Summary Table

kable(data.frame(Model_Type = c('Linear', 'Random Forest', 'CART'),
           R_sq_train = round(c(R_sq_lm_train, R_sq_rf_train, R_sq_cart_train), 2),
           R_sq_test = round(c(R_sq_lm_test, R_sq_rf_test, R_sq_cart_test), 2))) %>% 
  kable_classic(full_width = FALSE, html_font = "Cambria")%>%
  row_spec(2, background = c("yellow"))
Model_Type R_sq_train R_sq_test
Linear 0.55 0.08
Random Forest 0.86 0.76
CART 0.90 0.65

We can see that our Random Forest model has the best performance on the test data, so we will move forward with this model selected. We will now estimate the monthly volume our audo body repair center can expect in the future, using our expected market share of 0.0316%.

We therefore select the random forest model due to its superior performance.

Prediction

We continue with the assumption here is that our repair shop will enjoy perfect competition. This would mean that it would have an equal share of the demand i.e. 0.0316%. With this assumption in tow, we estimate the shop’s future monthly car volume.

Computation

First, we formally compute the market share of the shop.

number_of_repair_shops <- 3164
Shop_Market_Share <- 3164/(100*number_of_repair_shops)

Now, we use the test data that we had set aside to predict the monthly collision volume.

test_agg_df <- test_df %>%
  group_by(MONTH) %>%
  summarise(Actual_NYC_Collisions=sum(n), Actual_Shop_Volume=round(Shop_Market_Share*sum(n), 0))

Here, we create dataset for future year (2021).

temp1 <- data.frame(TIMESTEP = c(48:59))
temp2 <- data.frame(MONTH = c('01','02','03','04','05','06','07','08','09','10','11','12'))
temp3 <- data.frame(PERSON_AGE = c(15:100))
temp4 <- data.frame(PERSON_SEX = c('M','F'))
monthly_predictions_df <- temp1 %>%
  bind_cols(temp2) %>%
  full_join(temp3, by=character()) %>%
  full_join(temp4, by=character())

And finally, we make predictions.

monthly_predictions_df$predictions <- predict(rf_model, newdata = monthly_predictions_df)
monthly_predictions_agg_df <- (monthly_predictions_df %>% 
                                 group_by(MONTH) %>% 
                                 summarise(Predicted_NYC_Collisions=round(sum(predictions), 0)) %>%
                                 mutate(Predicted_Shop_Volume = round(Shop_Market_Share*Predicted_NYC_Collisions, 0)) %>% 
                                 left_join(test_agg_df, by='MONTH') %>% 
                                 mutate(YEAR = 2021) %>% 
                                 select(YEAR, MONTH, Actual_NYC_Collisions, Actual_Shop_Volume, Predicted_NYC_Collisions, Predicted_Shop_Volume)
                                 )   

kable(monthly_predictions_agg_df) %>% 
  kable_classic(full_width = FALSE, html_font = "Cambria")
YEAR MONTH Actual_NYC_Collisions Actual_Shop_Volume Predicted_NYC_Collisions Predicted_Shop_Volume
2021 01 9671 97 16482 165
2021 02 8432 84 15864 159
2021 03 10648 106 15574 156
2021 04 11501 115 12726 127
2021 05 13394 134 14655 147
2021 06 13745 137 14868 149
2021 07 12645 126 15234 152
2021 08 12473 125 15381 154
2021 09 12623 126 15488 155
2021 10 13097 131 15758 158
2021 11 11522 115 15324 153
2021 12 NA NA 15081 151

Plot of actual versus predicted demand volume

plot_ly(x=monthly_predictions_agg_df$MONTH, y=monthly_predictions_agg_df$Actual_Shop_Volume, 
        type='bar', name='Actual_Shop_Volume') %>% 
  add_trace(x=monthly_predictions_agg_df$MONTH, y=monthly_predictions_agg_df$Predicted_Shop_Volume, 
            type='bar', name='Predicted_Shop_Volume') %>% 
  layout(title = 'Plot of actual and predicted shop volume (Test Set)',
         xaxis = list(title = 'Months in 2021'),
         yaxis = list(title = 'Car Volume'))

Conclusion

Initial hypotheses

We can see that our model actually does quite a good job predicting the shop volume, compared the actual volume the shop saw in 2021. We hypothesize that our model would have performed even better without the unforeseen consequences of COVID-19 and the subsequent increase of remote work availability.

Our next task is to start answering our secondary research objectives, including identifying key demographics for use in a marketing campaign and then measuring its cost and effect.

Proportion of crashes by month, gender

kable(train_df %>% 
        group_by(PERSON_SEX) %>% 
        summarise(n = sum(n)) %>% 
    arrange(PERSON_SEX)) %>% 
  kable_classic(full_width = FALSE, html_font = "Cambria")
PERSON_SEX n
F 313706
M 895649

Proportion of crashes by month, age group

kable(train_df %>% 
        mutate(age_group = 5*floor(PERSON_AGE/5)) %>%
        group_by(age_group) %>% 
        summarise(n = sum(n)) %>% 
        arrange(age_group)) %>% 
  kable_classic(full_width = FALSE, html_font = "Cambria")
age_group n
15 26719
20 113808
25 161005
30 155614
35 137152
40 122047
45 116100
50 111292
55 98774
60 73487
65 44235
70 25668
75 13053
80 6627
85 2720
90 834
95 197
100 23

An analysis done by researchers in the UK found that in road accident data reported in Great Britain, their adjusted crash risk peaked at age 21-29 years and gradually reduced as age increased. They also found a relationship between the age of the driver and the time of day an accident occurred, namely that teenage drivers were at a lesser risk of crash than other age groups Regev, Rolison, & Moutari. Moving forward, we plan to leverage our data to determine the demographics most likely to be involved in a car crash. We expect to use multiple linear regression with squared predictors standardized around mean 0 and standard deviation 1. We will then determine the cost of a potential promotion according to the Average Clickthrough Rates for search ads and display ads (1.91% and .35%, respectively) Volovich. We will then outline the potential increase in market share and apply these outputs to a costing model for our auto repair shop to derive profits and return on assets for future investments, potentially using linear regression.

Future work

To recap, for the next milestone, we intend to

  1. Refine our models by exploring more advanced models such as boosted trees,
  2. Perform additional data exploration, time-permitting, to get more insights from our data,
  3. Leverage our data to determine the demographics most likely to be involved in a car crash and the impact to the business of targeting these demographics with advertising,
  4. Refine our report by diving deeper into the existing literature and provide clearer ties between that research and our analysis, and
  5. Refine our report by addressing any feedback received from the graders of the progress report.

Works Cited

He, Songtao et al. “Inferring High-Resolution Traffic Accident Risk Maps Based on Satellite Imagery and GPS Trajectories.” (2021) : 11957–11965.
“Motor Vehicle Collisions - Crashes.” 4 Jul. 2022. 4 Jul. 2022. <https://data.cityofnewyork.us/Public-Safety/Motor-Vehicle-Collisions-Crashes/h9gi-nx95>.
Motor Vehicles, New York State Department of. “Find a DMV Regulated Business.” 19 Jun. 2022. <https://process.dmv.ny.gov/FacilityLookup/?_ga=2.33023101.270841510.1656975631-932475252.1656975630&TSPD_101_R0=084c043756ab200049c13ac02223efa9441583bad77008d3b71790bcb79ba757be5b2aa10507b29f084bbc5efb143000172ce93dfec8420e0dc96ad9ed441e7f4a2448770215c0d5a6afcf5d46b82c8f7237478452ff354f386ddeb3a9b77d8a>.
Regev, Shirley, Jonathan J. Rolison, and Salissou Moutari. “Crash Risk by Driver Age, Gender, and Time of Day Using a New Exposure Methodology.” *Journal of Safety Research* 66 (2018) : 131–140. <https://www.sciencedirect.com/science/article/pii/S0022437517307600>.
Thompson, Karl. “What Is Normal?” 3 Sep. 2018. 4 Jul. 2022. <https://revisesociology.com/2018/09/03/what-is-normal/>.
Volovich, Kristina. “What’s a Good Clickthrough Rate? New Benchmark Data for Google AdWords.” 4 Jul. 2022. <https://blog.hubspot.com/agency/google-adwords-benchmark-data>.

  1. We assume, of course, that all of their customers have normal psychological patterns Thompson↩︎